Goto

Collaborating Authors

 data security


A Survey on Data Security in Large Language Models

Chen, Kang, Zhou, Xiuze, Lin, Yuanguo, Su, Jinhe, Yu, Yuanhui, Shen, Li, Lin, Fan

arXiv.org Artificial Intelligence

Large Language Models (LLMs), now a foundation in advancing natural language processing, power applications such as text generation, machine translation, and conversational systems. Despite their transformative potential, these models inherently rely on massive amounts of training data, often collected from diverse and uncurated sources, which exposes them to serious data security risks. Harmful or malicious data can compromise model behavior, leading to issues such as toxic output, hallucinations, and vulnerabilities to threats such as prompt injection or data poisoning. As LLMs continue to be integrated into critical real-world systems, understanding and addressing these data-centric security risks is imperative to safeguard user trust and system reliability. This survey offers a comprehensive overview of the main data security risks facing LLMs and reviews current defense strategies, including adversarial training, RLHF, and data augmentation. Additionally, we categorize and analyze relevant datasets used for assessing robustness and security across different domains, providing guidance for future research. Finally, we highlight key research directions that focus on secure model updates, explainability-driven defenses, and effective governance frameworks, aiming to promote the safe and responsible development of LLM technology. This work aims to inform researchers, practitioners, and policymakers, driving progress toward data security in LLMs.


AI dashcams enhance trucker safety while raising privacy concerns

FOX News

There are concerns about how the technology might affect personal space and data security. The trucking industry is in the midst of a technological revolution, thanks to the arrival of artificial intelligence-powered dashcams. These innovative devices promise to make roads safer and operations more efficient, but they also raise some important questions about privacy. For truck drivers, other motorists and even pedestrians, there are valid concerns about how this technology might affect their personal space and data security. GET SECURITY ALERTS & EXPERT TECH TIPS -- SIGN UP FOR KURT'S THE CYBERGUY REPORT NOW AI dashcams are transforming road safety and fleet management through advanced computer vision technology.


SecBench: A Comprehensive Multi-Dimensional Benchmarking Dataset for LLMs in Cybersecurity

Jing, Pengfei, Tang, Mengyun, Shi, Xiaorong, Zheng, Xing, Nie, Sen, Wu, Shi, Yang, Yong, Luo, Xiapu

arXiv.org Artificial Intelligence

Evaluating Large Language Models (LLMs) is crucial for understanding their capabilities and limitations across various applications, including natural language processing and code generation. Existing benchmarks like MMLU, C-Eval, and HumanEval assess general LLM performance but lack focus on specific expert domains such as cybersecurity. Previous attempts to create cybersecurity datasets have faced limitations, including insufficient data volume and a reliance on multiple-choice questions (MCQs). To address these gaps, we propose SecBench, a multi-dimensional benchmarking dataset designed to evaluate LLMs in the cybersecurity domain. SecBench includes questions in various formats (MCQs and short-answer questions (SAQs)), at different capability levels (Knowledge Retention and Logical Reasoning), in multiple languages (Chinese and English), and across various sub-domains. The dataset was constructed by collecting high-quality data from open sources and organizing a Cybersecurity Question Design Contest, resulting in 44,823 MCQs and 3,087 SAQs. Particularly, we used the powerful while cost-effective LLMs to (1). label the data and (2). constructing a grading agent for automatic evaluation of SAQs. Benchmarking results on 16 SOTA LLMs demonstrate the usability of SecBench, which is arguably the largest and most comprehensive benchmark dataset for LLMs in cybersecurity. More information about SecBench can be found at our website, and the dataset can be accessed via the artifact link.


Trust and Dependability in Blockchain & AI Based MedIoT Applications: Research Challenges and Future Directions

Solaiman, Ellis, Awad, Christa

arXiv.org Artificial Intelligence

This paper critically reviews the integration of Artificial Intelligence (AI) and blockchain technologies in the context of Medical Internet of Things (MedIoT) applications, where they collectively promise to revolutionize healthcare delivery. By examining current research, we underscore AI's potential in advancing diagnostics and patient care, alongside blockchain's capacity to bolster data security and patient privacy. We focus particularly on the imperative to cultivate trust and ensure reliability within these systems. Our review highlights innovative solutions for managing healthcare data and challenges such as ensuring scalability, maintaining privacy, and promoting ethical practices within the MedIoT domain. We present a vision for integrating AI-driven insights with blockchain security in healthcare, offering a comprehensive review of current research and future directions. We conclude with a set of identified research gaps and propose that addressing these is crucial for achieving the dependable, secure, and patient -centric MedIoT applications of tomorrow.


Ensuring superior learning outcomes and data security for authorized learner

Bang, Jeongho, Song, Wooyeong, Shin, Kyujin, Kim, Yong-Su

arXiv.org Machine Learning

The learner's ability to generate a hypothesis that closely approximates the target function is crucial in machine learning. Achieving this requires sufficient data; however, unauthorized access by an eavesdropping learner can lead to security risks. Thus, it is important to ensure the performance of the "authorized" learner by limiting the quality of the training data accessible to eavesdroppers. Unlike previous studies focusing on encryption or access controls, we provide a theorem to ensure superior learning outcomes exclusively for the authorized learner with quantum label encoding. In this context, we use the probably-approximately-correct (PAC) learning framework and introduce the concept of learning probability to quantitatively assess learner performance. Our theorem allows the condition that, given a training dataset, an authorized learner is guaranteed to achieve a certain quality of learning outcome, while eavesdroppers are not. Notably, this condition can be constructed based only on the authorized-learning-only measurable quantities of the training data, i.e., its size and noise degree. We validate our theoretical proofs and predictions through convolutional neural networks (CNNs) image classification learning.


MAIDS: Malicious Agent Identification-based Data Security Model for Cloud Environments

Gupta, Kishu, Saxena, Deepika, Gupta, Rishabh, Singh, Ashutosh Kumar

arXiv.org Artificial Intelligence

With the vigorous development of cloud computing, most organizations have shifted their data and applications to the cloud environment for storage, computation, and sharing purposes. During storage and data sharing across the participating entities, a malicious agent may gain access to outsourced data from the cloud environment. A malicious agent is an entity that deliberately breaches the data. This information accessed might be misused or revealed to unauthorized parties. Therefore, data protection and prediction of malicious agents have become a demanding task that needs to be addressed appropriately. To deal with this crucial and challenging issue, this paper presents a Malicious Agent Identification-based Data Security (MAIDS) Model which utilizes XGBoost machine learning classification algorithm for securing data allocation and communication among different participating entities in the cloud system. The proposed model explores and computes intended multiple security parameters associated with online data communication or transactions. Correspondingly, a security-focused knowledge database is produced for developing the XGBoost Classifier-based Malicious Agent Prediction (XC-MAP) unit. Unlike the existing approaches, which only identify malicious agents after data leaks, MAIDS proactively identifies malicious agents by examining their eligibility for respective data access. In this way, the model provides a comprehensive solution to safeguard crucial data from both intentional and non-intentional breaches, by granting data to authorized agents only by evaluating the agents behavior and predicting the malicious agent before granting data.


Block MedCare: Advancing healthcare through blockchain integration with AI and IoT

Simonoski, Oliver, Bogatinoska, Dijana Capeska

arXiv.org Artificial Intelligence

This research explores the integration of blockchain technology in healthcare, focusing on enhancing the security and efficiency of Electronic Health Record (EHR) management. We propose a novel Ethereum-based system that empowers patients with secure control over their medical data. Our approach addresses key challenges in healthcare blockchain implementation, including scalability, privacy, and regulatory compliance. The system incorporates digital signatures, Role-Based Access Control, and a multi-layered architecture to ensure secure, controlled access. We developed a decentralized application (dApp) with user-friendly interfaces for patients, doctors, and administrators, demonstrating the practical application of our solution. A survey among healthcare professionals and IT experts revealed strong interest in blockchain adoption, while also highlighting concerns about integration costs. The study explores future enhancements, including integration with IoT devices and AI-driven analytics, contributing to the evolution of secure, efficient, and interoperable healthcare systems that leverage cutting-edge technologies for improved patient care.


Redefining Data-Centric Design: A New Approach with a Domain Model and Core Data Ontology for Computational Systems

Johnson, William, Davis, James, Kelly, Tara

arXiv.org Artificial Intelligence

Before this, fragmented computer networks struggled to communicate seamlessly. The introduction of the Transmission Control Protocol/Internet Protocol (TCP/IP) enabled consistent data transfer and became the standard for digital communication. However, this node-centric approach, which relies heavily on Internet Protocol (IP) addresses, has also created significant security vulnerabilities and privacy concerns due to its focus on network nodes rather than the data itself. In today's digital landscape, the centralized aggregation and storage of sensitive user data -- including IP addresses -- by service providers pose substantial security risks. These centralized repositories are prime targets for cyberattacks, potentially compromising user privacy and exposing sensitive information. Additionally, the reliance on IP-based system modeling has amplified these risks, necessitating a shift toward a more secure and resilient design approach. This paper proposes a novel data-centric design methodology that moves away from traditional node-focused models. By prioritizing data as the central entity and incorporating multimodal frameworks encompassing objects, events, concepts, and actions, this approach enhances data security and flexibility. The new informatics domain model reimagines data's role in system design, emphasizing its importance throughout its entire lifecycle to foster innovation, security, and seamless data interoperability.


PristiQ: A Co-Design Framework for Preserving Data Security of Quantum Learning in the Cloud

Wang, Zhepeng, Sheng, Yi, Koirala, Nirajan, Basu, Kanad, Jung, Taeho, Lu, Cheng-Chang, Jiang, Weiwen

arXiv.org Artificial Intelligence

Benefiting from cloud computing, today's early-stage quantum computers can be remotely accessed via the cloud services, known as Quantum-as-a-Service (QaaS). However, it poses a high risk of data leakage in quantum machine learning (QML). To run a QML model with QaaS, users need to locally compile their quantum circuits including the subcircuit of data encoding first and then send the compiled circuit to the QaaS provider for execution. If the QaaS provider is untrustworthy, the subcircuit to encode the raw data can be easily stolen. Therefore, we propose a co-design framework for preserving the data security of QML with the QaaS paradigm, namely PristiQ. By introducing an encryption subcircuit with extra secure qubits associated with a user-defined security key, the security of data can be greatly enhanced. And an automatic search algorithm is proposed to optimize the model to maintain its performance on the encrypted quantum data. Experimental results on simulation and the actual IBM quantum computer both prove the ability of PristiQ to provide high security for the quantum data while maintaining the model performance in QML.


Artificial Intelligence enhanced Security Problems in Real-Time Scenario using Blowfish Algorithm

Chinnam, Yuvaraju, Sambana, Bosubabu

arXiv.org Artificial Intelligence

In a nutshell, "the cloud" refers to a collection of interconnected computing resources made possible by an extensive, real-time communication network like the internet. Because of its potential to reduce processing costs, the emerging paradigm of cloud computing has recently attracted a large number of academics. The exponential expansion of cloud computing has made the rapid expansion of cloud services very remarkable. Ensuring the security of personal information in today's interconnected world is no easy task. These days, security is really crucial. Models of security that are relevant to cloud computing include confidentiality, authenticity, accessibility, data integrity, and recovery. Using the Hybrid Encryption this study, we cover all the security issues and leaks in cloud infrastructure.